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Abstract. Transcription regulation is largely governed by the profile and the 
dynamics of transcription factors' binding to DNA. Stochastic effects are intrinsic 
to this dynamics and the binding to functional sites must be controled with a certain 
specificity for living organisms to be able to elicit specific cellular responses. Specificity 
stems here from the interplay between binding affinity and cellular abundancy of 
transcription factor proteins and the binding of such proteins to DNA is thus controlled 
by their chemical potential. 

We combine large-scale protein abundance data in the budding yeast with binding 
affinities for all transcription factors with known DNA binding site sequences to 
assess the behavior of their chemical potentials. A sizable fraction of transcription 
factors is apparently bound non-specifically to DNA and the observed abundances are 
marginally sufficient to ensure high occupations of the functional sites. We argue that a 
biological cause of this feature is related to its noise-filtering consequences: abundances 
below physiological levels do not yield signicant binding of functional targets and mis- 
expressions of regulated genes are thus tamed. 
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1. Introduction 

A major determinant in transcription regulation is the pattern of transcription 
factor proteins (TFs) bound in the physical proximity of the transcribed genomic 
locus [U [21 E]- Intense activity is currently carried out to identify transcriptional 
regulatory networks jH El [6] , their topology [Tl El El ID] and signs and strengths of the 
interactions [11]. Specificity is an obvious need in transcription regulation: functional 
binding sites ought to be sufficiently low in energy compared to typical sequences in 
the rest of the genome (the so-called background). This energetic constraint should be 
coupled with its kinetic counterpart: the TF should be able to rapidly find its functional 
targets. Existing evidence points at a search taking place via ID sliding along the DNA, 
alternated with 3D excursions [121 [13] • The TF is kept along the DNA by non-specific 
electrostatic interactions, recently characterized experimentally [T^[ IT5] . 

Two quantitative variables govern the binding of TFs to DNA: their cellular 
abundance and the affinity between the amino acids forming their binding domains and 
the various possible stretches of nucleotides. It has long been recognized in concrete 
examples that equilibrium statistical-mechanics models are poised to describe the 
binding site occupancy as a function of those parameters, and that these occupancies are 
proxies for transcription rates transcription [fl [16]. Detailed models for the probability 
of binding to DNA by TFs have recently been reviewed in [T71 [18] and we refer the 
interested reader thereto (see also Methods for a concise summary). 

The qualitative point of importance here is that the probability of TF's binding 
to DNA is controlled by its so-called chemical potential fi. As illustrated in figure [TJ 
strong binding sites (with energy much lower than fi) are occupied almost certainly, 
while weak sites, with energies much higher than /i, are most frequently empty. The 
chemical potential fi increases with the number of copies n of the transcription factor 
as logn (see, e.g., [T71 HE])- For a single copy n — 1, the value of the chemical potential 
defines an offset Fb, usually called background energy. The reason is that Ff, controls 
the fraction of TF copies bound to DNA either non-specifically or to the genomic 
background. Indeed, let E* denote the minimal binding energy, i.e. the energy of 
binding to the consensus sequence of the TF. From the previous relation fi — Fb oc logn, 
it follows that if Ft ~ E*, then the threshold defined by the chemical potential fi is 
larger than (or equals) E* for any n > 1. In other words, even a single copy of the 
transcription factor would then be sufficient to ensure persistent binding, at least of the 
consensus sequence. Conversely, as Fb becomes less than E*, more and more TF copies 
n are needed to have /i > E*, i.e. persistent occupancy of at least the strongest binding 
sites. A minimal abundance (which depends exponentially on the difference E* — Fb) 
is then required to have persistent binding of the strongest sites (supposed to be the 
functional ones). 

Detailed quantitative information on the behavior of the chemical potential for 
transcription factors of biological interest is scanty. The relation between binding 
affinities and abundances was analyzed in [19] for three coliphage TFs (Mnt, CI and 
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Cro) and one bacterial TF (LacR). The result was that the offset Fb is comparable to the 
consensus energy E* for those four TFs. This type of relation endows the cell with the 
widest possible window to vary the TF copy number and differentially regulate various 
sets of genes. It was therefore dubbed "maximum programmability" |19j . 

Positing Fb ~ E* generally valid seems however too strong a requirement for the 
cellular dynamics, as it would make regulation too prone to errors. In fact, as already 
noted in [Tj5], the four TFs which were considered are rather special: they are all 
repressors, they operate without much combinatorial interactions with other factors and 
their expression is tightly controlled. This is not the situation encountered in general. 
Namely, combinatorial regulation is much more frequent, especially in eukaryotes, and 
a large fraction of genes are activated by TFs to their physiological expression levels. 
Specificity is not arising from a single transcription factor but from the sinergistic and 
cooperative combination of several factors. We then expect that the relation Fb ~ E*, 
found in [19] for four particular TFs, does not have general validity and that a different 
relation holds in the majority of cases. Our goal here is to quantify and support this 
expectation by analyzing experimental data for a large set of transcription factors. 

A good model organism to quantitatively investigate the previous issue is the 
budding yeast S. cerevisiae. Concentration data in the log-growth phase [20] and 
large-scale chromatin immunoprecipitation binding data, as given by [SJ 121] , are both 
available. The intersection of the two data sets leaves us with a set of 63 TFs. The 
difficulty to be overcome is that large-scale experimental data on binding do not directly 
provide affinities. A priori, calorimetric methods [221 E3] might be employed to measure 
the strength of the interaction of a TF with its binding sites, but these methods have 
been hard to scale up, and values are typically not available for a given TF. One is thus 
forced to infer affinity matrices in silico, from a list of experimentally detected binding 
sites. The procedures and the limitations of these inferences are recalled in the Methods, 
together with the basics of statistical models for TF-DNA interactions. Two different 
inference methods were employed: the classical maximum likelihood argument by Berg 
and von Hippel [21] and the QPMEME method, recently introduced in [23]. Results 
for the relation between affinity and TF abundance, for both ways of determining the 
binding energies, are presented hereafter. Biological consequences, in particular for the 
control of noise in transcription regulation, are presented in the Discussion. 

2. Results 

Combining the two experimental data sets on abundance [20] and chromatin 
immunoprecipitation [51 [21], a set of 63 TFs was identified. Affinity matrices for 
those TFs were then inferred as detailed in Methods, using both the classical maximum 
likelihood procedure [21] and the QPMEME method [25]. In both cases, the matrices 
are a priori determined only up to a scale factor. In the first case, following [21], the 
factor was set to one in units of ksT. In the QPMEME method, the scale factor was 
determined as described in the Methods via a self-consistency condition, based on the 
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experimental information on TF abundances. This condition could be satisfied in 41 
out of the 63 cases. In the remaining 22 cases no solution could be found, for reasons 
that will be presented in the Discussion. 

The matrices derived by the two aforementioned methods agree well in the majority 
of the 41 cases where both methods could be employed. A first measure of the agreement 
between two energy matrices is whether they give the best binder at each position, which 
indeed coincides for 26 TFs out of 41. These 26 instances include cases where one TF 
admits more than one consensus sequence, but where both matrices agree on at least 
one consensus binder at each position. In 14 cases the sets of consensus sequences agree 
completely. For 15 TFs the sets of best binders of the two matrices at some position 
are not overlapping, i.e. in at least one position the sets of best binders differ. 

A more quantitative comparison, sensitive to the full energy matrix and not just 
to the best binder, is to consider the normalized probabilities qi <a , i.e. the probability 
that nucleotide a be found at the position % of the DNA-TF binding complex. The 
probabilities computed using the maximum likelihood procedure j2l] or QPMEME (25] 
are denoted by gf^ H and (ff^i respectively. The difference between the two sets of 
probabilities is quantified by the symmetric Kullback-Leibler relative entropy : 

s( q f vH , q ? p ) = \ E {C H - €) ^ ^ • (i) 

Figure [2] shows the mean Kullback-Leibler relative entropy per base pair for the 41 
TFs. Except in a few cases, the average differences per base pair are moderate, on the 
order of 0.1 — 0.2. No correlation was detectable between the relative entropies and 
the number of observed binding sites employed to infer the affinity matrices, indicating 
that the differences between the QPMEME and Berg- von Hippel matrices are bona fide 
fluctuations and not due to finite sample effects. Detailed properties of the affinity 
matrices computed using the two methods are reported in table 1. 

As a side remark, note that the average discrimination energy per site generally 
decreases with the length of the binding site, indicating a trade-off between these two 
quantities. Figure [3] displays the data for the QPMEME-derived energy matrices; a 
similar behavior is found for matrices inferred by maximum likelihood. 

Figure [4] reports the behavior of the background energy F&, previously defined 
as the offset of the chemical potential fi (its value at unit copy number n — 1). The 
result is that the maximum programmability relation ~ E* proposed in [Tj5] is indeed 
peculiar to the three coliphage and the bacterial TFs which were considered. A different 
behavior is clearly observed in the yeast S. cerevisiae. The background energy F b is 
not comparable to the consensus binding energy E*, but is generally smaller and the 
difference is correlated with the experimentally observed abundancy 

^obs5 can De 

seen in figure |4j In other words, the experimental observations are more in agreement 
with the behavior E* — Ft, oc logn bs than the maximum programmability relation 
E* — -Ft, ~ 0. Note that this holds irrespective of the method (maximum likelihood or 
QPMEME) used to estimate the discrimination matrices. 
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Figure 1. Left panel: A schematic view of the relation between the probability of 
binding to DNA for a transcription factor and its chemical potential fi. Strong binding 
sites (with energies much lower than the chemical potential) have a high occupation 
probability (purple solid line), while the probability to bind decreases rapidly as the 
energy increases. Right panel: the relation between the chemical potential and the 
abundance n. The background (free) energy Ft is the value of /x for n = 1. 
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Figure 2. Average Kullback-Leibler distance per base pair between the probability 
distributions of binding based on computing discrimination energies by maximum 
likelihood arguments [H] or QPMEME [25] (see also Methods). 




Figure 3. Average discrimination energy vs length of binding sites. Reported values 
refer to energy matrices computed using the QPMEME method, as described in the 
Methods. 
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Table 1. Binding parameters for a set of 63 TFs of the yeast S. cerevisiae, stating 
numbers of binding sites used in the analysis (a), experimentally measured protein 
abundances (b), maximal ratio of binding energy to chemical potential (cf. equation 4 
in Methods) (c), and in units of ksT the estimates for the chemical potential (d) and 
minimal binding energies (consensus), stemming from both BvH (e) and QPMEME 
matrices (f), respectively. 
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As a further test, we compared the experimentally measured TF abundances with 
the number of binding sites found in SGD [27] and as reported by Lee et al. [5]. For 
the latter, we counted all sites of protein-DNA-interaction with associated p-values 
< 1 ■ (LI) and < 5 • 1CT 3 (L5). The rationale of this analysis is as follows. If 
the maximum programmability ansatz — E* ^ were satisfied, we should expect 
that TF abundances are the main leverage in the control of the number of binding sites. 
This is the heuristic advantage provided by maximum programmability [Tj5] and a strong 
dependence of the number of binding sites on the TF abundance should then be present. 
No such behavior is expected for the alternative hypothesis E* —F b cx logn b s : a sizable 
fraction of the TF copies are weakly attached to the DNA, yet the sites are sufficiently 
numerous to compete with high-specificity sites. A straightforward regression analysis 
gives coefficients of regression R 2 close to zero, viz. 0.0440 for the SGD set and 0.0513 
and 0.0900 for the LI and L5 sets, respectively. Even though the p-values for the three 
sets show some statistical significance (0.07, 0.04, 0.006, respectively), the low values 
of R 2 indicate that the fraction of the variance explained by the regression is scanty. 
To summarize, the correlation between the number of binding sites and abundance is 
slightly significant (as should be expected) but the weakness of the dependency confirms 
previous conclusions. 

3. Discussion 

The integration of binding data provided by chromatin immunoprecipitation 
experiments [27J E] and abundance data from [2U] allowed us to extract information 
on the relation between binding affinities and abundances of TFs in the log-growth 
phase of the budding yeast S. cerevisiae. The availability of experimental data for other 
conditions would enable a wider perspective, yet two main points have already emerged 
here and are worth being discussed in their biological consequences and significance. 

A first technical point is that, while bioinformatic tools to infer binding free energies 
generally only give these up to a scale factor, we have shown that combining the recent 
method QPMEME [21] and abundance data can provide an estimate of that factor. 
This may be of general methodological interest and useful for future applications. 

For the budding yeast problem considered here, the scale factor could be estimated 
for 41 transcription factors out of 63. For the remaining 22 TFs "individual specificity" is 
not ensured by the observed affinities and abundances, i.e. the binding sites are bound 
even though their energy is larger than the chemical potential. This prevents using 
QPMEME, since the method works in the strong binding regime and supposes that all 
binding sites have energy below the chemical potential. Biologically, having binding 
sites occupied despite their energies being above the chemical potential does not pose 
any contradiction, since additional effects such as other factors and/or regulations of the 
chromosomal structure might crucially contribute to specificity. Indeed, ChIP data (see 
figure 2 in [5]) clearly indicate that many genes of S. cerevisiae are regulated by multiple 
TFs. Furthermore, global chromatin remodeling effects will reduce the effective size of 



Numbers and affinity 



8 



the genome which is accessible to TFs and increase specificity. Finally, in eukaryotes 
it is well known that combinatorial regulation is widespread j[3J and its mode of action 
hinges on strong cooperative effects among the TFs. The corresponding loci are often 
structured so as to require the synergistic action of various TFs and to remain unbound 
and inactive if only one of them is present. Results of our analysis are in quantitative 
agreement with this picture. 

The second and main result of our work is that experimentally observed abundances 
are marginally sufficient to ensure strong and persistent binding of S. cerevisiae TFs to 
DNA sites. This is quantified and supported by the results presented in figure [4} More 
technically, the background free energy F^, was found to be negative and proportional to 
logn bs, where n Q b s is the abundance experimentally measured in [20J. Consequently, the 
chemical potential \i remains below the minimal consensus energy E* if n <C n a ^ s . This 
implies that a sizable part of the TF copies are "lost in the background" and that the 
in vivo observed binding sites are only occupied with low probability if the abundance 
is significantly lower than n obs . 

What might superficially appear as a waste, ensures in fact an effective noise- 
filtering procedure. Fluctuations in the copy number of proteins are unavoidable in 
the molecular world and have been experimentally demonstrated in various cases (see, 
e.g., [28]). A few spurious copies of TFs might be present in the cell due to a variety 
of mechanisms, going from delayed degradations, to leaks or lack of tight regulatory 
controls and fluctuations in the expression rates. In an E. coli system, it has recently 
been shown that extrinsic effects, over and above cell-cycle dependent changes in gene 
copy number, acting e.g. through different concentrations of metabolites, ribosomes and 
polymerases, may amount to 35% fluctuations in gene expression levels, and may persist 
over a cell cycle [29]. Intrinsic fluctuations, while persisting for shorter times, are also 
significant, at the 20% level [29]. The relation E*—E h oc logn obs between the background 
affinity energy and the abundance of the transcription factors shown in figure [4] ensures 
an effective way to filter out those fluctuations and control mis-regulations. 

4. Conclusions 

In conclusion, our results point at the importance of quantitative effects of abundances 
in the regulatory dynamics of the cell. In particular, the abundance-affinity relationship 
E*—F\) oc log n Q bs demonstrated here is a powerful control lever to ensure global coherent 
responses of the cellular regulatory networks despite the noisy nature of their individual 
molecular components. 
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5. Methods 

Let us consider a TF that diffuses in a cell containing a genomic sequence of length L. 
The partition function of specific and non-specific binding to DNA is 

Z b = J2e^ E ^ + Le-^% (2) 

3=1 

where (3 is the inverse temperature in units of the Boltzmann constant k-Q and Sj is 
the subsequence of length I starting at position j in the genomic sequence. In ^ we 
have omitted the contribution from the TF freely diffusing in cytoplasm, assuming that 
number to be much smaller than the number of TFs bound. E QS denotes the energy of 
the state where the TF is bound non- specifically to the DNA [12j [131 M ■ From ©, it 
follows the definition of the effective background (free) energy as: 

F h = -/T 1 logZ b . (3) 

A commonly employed expression for the binding energies E(S) is the additive 
energy matrix form [301 ED E2] : 

^) = EE £ .A«- (4) 

i=l a=l 

Here, the indicator vector Si >a has entries zero or one depending on which nucleotide a 
stands at position i in the sequence S, Si^ a is the free energy contribution of nucleotide 
a at i and i is the length of the binding domain. Even though exceptions are known 
[33], the linear form Q generally gives a good approximation of the energy profile [33]. 

Expression ^ of the background energy F^ may be approximated by an average 
over a random ensemble (background). The approximation is justified in [19] by a 
mapping to the Random Energy Model [35]. As for the choice of the random ensemble, 
the simplest background model features independent nucleotides generated with the 
average genomic frequencies p a (a = A, C, G, T), yielding: 

i 



3 i=l 

It follows that 



e 



(5) 



-/T 1 log {L J dE p(E) e~ pE + L e~ pE ™ } , (6) 



where p{E) = (5 (E — J2i,a e i,aSi, a )) is the density of states for the random ensemble. 
The background density p(E) can be computed by a saddle point expansion, where the 
first term is Gaussian [25]. Figure [5] compares the empirical energy density (obtained by 
the histogram of the energies measured over the whole genome) with the Gaussian and 
the first correction. While the former alone would not be appropriate (the empirical 
curve is not symmetric), the correspondence with the latter is quite fair. For a few TFs 
the match is less good, mainly because discretization effects are more pronounced. 
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For a TF present with n copies in the cell, the probability that a sequence Si be 
bound by the TF takes the Fermi-Dirac form (see, e.g., [T9l [T7J [18] for more details): 

V ^ = 1 + e 0(E(Si)-ri ' ( 7 ) 

with the chemical potential // implicitly defined by 

n = LjdE [p(E) + 5{E - E m )] ^ - . (8) 

Equation (fsb simply states that the sum over all the binding sites, weighted by the 
probability that a TF is bound there, equals the copy number of the TF in the system. 



5.1. Inference of binding properties 

A list of binding sites for a wide set of TFs of S. cerevisiae was downloaded from the SGD 
database [27] . The binding sites were extracted from the intergenic regions identified by 
chromatin immunoprecipitation experimental data [5] as detailed in |21j . We retained 
those TFs for which at least two binding sites and their abundance were available and 
processed them as detailed hereafter. 

A proxy of the binding properties of the TFs is provided by the log-odds ratios 
based on the classical work [21]: 

1 1 + n* 

As, Q = i log , (9) 

A 1 + n i>a 

where nj jQ is the number of observations of nucleotide a at the i-th position in the binding 
site and n* is the number of observations of the most frequently observed nucleotide in 
that position. A is an unknown scale factor in units of k-e,T. 

The discrimination energy of a sequence S is defined as the difference between E(S) 
and the consensus energy and is hence directly given by Aej )Q in ([9]). The scale factor A 
must be determined from at least one experimentally measured affinity. In the absence 
of experimental data, we have set it to unity (in units of k-gT), which is a fair average 
of the values found for a number of prokaryotic examples in [21], and concords with 
bioinformatic practice [34]. 

As a second proxy we have used the recently introduced QPMEME method [25] . 
This also does not give access to the binding energies as such, but to the ratio of 
binding energies to a chemical potential, shifted by the mean free energy of binding of 
the corresponding TF: 

_E-(E)_E_^ 

where (•) denotes the average over the random background ensemble defined as before. 
The calculation of the matrix £j jQ , boils down to a convex optimization problem, 
where the width of the background probability distribution is minimized under the 
constraints that all sequences in the training set be bound. Note that neither the average 
energy (E) = J2i, a Pi,a£i,a nor the chemical potential \i are determined by QPMEME. 
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Differences between pairs of energies, e.g. discrimination energies, are determined up 
to the scale factor |/t|. 




Figure 4. Comparison of the relation between the background energy Ff, and the 
abundance for a set of S. cerevisiae transcription factors. Values of the difference 
between the consensus energy E* and the background energy Ft, are reported as 
squares. Their values shifted by the logarithm of the TF abundance (as measured 
experimentally) are reported as circles. Vertical dashed lines correspond to the average 
values for the two sets of points. Points have a sizeable scatter but circles arc clearly 
centered around zero. No relation has been found between the deviation of the points 
around zero and the functional role of the corresponding TFs. Long panels: results 
for log-odds ratio matrices; short panels: results for QPMEME matrices. Histograms 
give better visual access to the distribution widths. 

The energy matrices Ae^ and £j )Q have finite sample errors, which could in principle 
be estimated as in [24J. Assuming the sample to be non-biased, these errors decrease 
with the number of known binding sites Nbs as 1/ v / Abs • A comparison with table 1 
reveals that this error is at least on the order of 10% (for those TFs for which about a 
hundred binding sites are known), ranging up to 50% (for those with only a few binding 
sites known). The chemical potential is determined by the reduced energy matrix and 
the observed abundance n h s , which also has experimental errors and is likely to fluctuate 
in vivo. An estimate of the error in the estimation of the chemical potential is thus at 
best on the order of 10%. This should nevertheless be sufficient to elucidate statistical 
trends, which is our purpose here. 

The probabilities q^ a appearing in the Results denote the probabilities that 
nucleotide a is found at position % in the TF-DNA complex. They are computed from 
the energy matrices Ae i Q , as : 

g-jSAe^c, 
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5.2. Computing the background free energy 

Definition ^ involves two terms: one describing binding to the genomic background and 
the other non-specific electrostatic interactions with the DNA. The latter is crucial to the 
target search [12]. As shown in [19], the background contribution cannot be larger that 
the non-specific part: the TF would otherwise diffuse in the background random medium 
and get slowed down by its local minima. In fact, the two contributions are expected 
to be comparable. The division in background and functional binding sites is indeed 
dynamical and the former provides the evolutive reservoir for the latter. Therefore, 
evolvability of the regulatory network suggests that the background energy will tend 
to be low, compatibly with the aforementioned specificity and kinetic constraints (see 
[361 EZ] about evolvability) . 

_2 Binding energy distribution of ABF1 
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Figure 5. The density of states for the TF ABF1. Dashed in black, the curve 
obtained for a random background. In red, the empirical curve found by computing 
the distribution of energies over the genome. The energy scale has been chosen so as 
to have the chemical potential ji = — 1. 



Our estimate for the background energy in (Jsl) is then: 



(3{E* -F h )= log 



2L / dr p(r)e 



-C9|AI)(r-r«) 



(12) 



Here, p(r) is the background density of states for the energy matrix ii jCt obtained by 



QPMEME and r* is the minimal value of the ratio (10), that is for the energy E* of 



the consensus sequence(s) S*. The shift to E* in (12) is introduced just to facilitate 
comparison with the results in figure |4j 

The quantity (3\fi\ is not determined by the QPMEME method proper. We 
estimate it using the relation ([8]), the fact that in QPMEME binding energies are only 
determined up to the relative chemical potential, and the additional information on 
the TF abundance n D b s from [20j. Using the previous arguments on background and 
non-specific contributions, we get: 

n ohs = 2Ljdr p(r) - - J Mr+1) , (13) 
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whence j3\fi\ is extracted and inserted back into (12) to obtain the value of the 
background effective (free) energy F h . As previously discussed, (13) only has a solution 



for 41 cases out of 63 TFs. It is instructive to compare with (|8j), which has a solution 
for every TF. The chemical potential [i then simply acts as a cut-off, so that sites with 
energies lower than /i are mostly bound, while sites with higher energies are not, and 
the total number of bound TFs equals n ^ s . Depending on n Q b s , the mostly unbound 
sites could or could not include in vivo observed binding sites i.e., part of the set of 
sites from which the maximum likelihood energy matrices have been constructed. In 
(13), on the other hand, all the in vivo binding sites must necessarily have binding 
energy below the chemical potential, because these are the constraints under which the 
QPMEME reduced energy matrix e is determined. Hence, all sites for which the reduced 
QPMEME reduced energy is below the threshold —1 will be at least half-filled. Each 
of these is actually present in the genome with some probability, which leads to a total 
expected number of at least half-filled sites. Therefore, (13) cannot be solved if n obs is 
low enough, because the right-hand side has a lower bound. This happens in about one 
third of the cases at hand. 



5. 3. Maximal programmability 

In the simplest scenario where the major contribution in ^ stems from energies where 
the Fermi-Dirac weight can be approximated by the Boltzmann factor, one can invert 
g to obtain 

/i^/TMogn + Fb. (14) 

The occupation probability of a site t reads then P t = 1+ 3^ n , where the threshold 
concentration h t is e@( Et ~ Fb \ The minimal copy number required for strong binding (to 
the consensus) must then be at least e^ E *~ Fh \ 

Maximal programmability [19] amounts to positing the lowest (unity) threshold. 
The approximate equality Fy, w E* should then hold. One consequence, which motivates 
the term, is that the consensus sequence is then half-bound if there is just a single copy 
of the TF present in the cell. Different regulatory elements can then have threshold set, 
or programmed, from one, if their sequences are the consensus sequence, and upwards, 
independently of a feedback induced by the actual TF copy number. 
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